Goto

Collaborating Authors

 cache miss


Impact of Data-Oriented and Object-Oriented Design on Performance and Cache Utilization with Artificial Intelligence Algorithms in Multi-Threaded CPUs

Arantes, Gabriel M., Pinto, Richard F., Dalmazo, Bruno L., Borges, Eduardo N., Lucca, Giancarlo, de Mattos, Viviane L. D., Cardoso, Fabian C., Berri, Rafael A.

arXiv.org Artificial Intelligence

This study provides a comprehensive performance analysis of Data-Oriented Design (DOD) versus the traditional Object-Oriented Design (OOD), focusing on cache utilization and efficiency in multi-threaded environments. We developed and compared four distinct versions of the A* search algorithm: single-threaded OOD (ST -OOD), single-threaded DOD (ST -DOD), multi-threaded OOD (MT -OOD), and multi-threaded DOD (MT -DOD). The evaluation was based on metrics including execution time, memory usage, and CPU cache misses. In multi-threaded tests, the DOD implementation demonstrated considerable performance gains, with faster execution times and a lower number of raw system calls and cache misses. While OOD occasionally showed marginal advantages in memory usage or percentage-based cache miss rates, DOD's efficiency in data-intensive operations was more evident. Furthermore, our findings reveal that for a fine-grained task like the A* algorithm, the overhead associated with thread management led to single-threaded versions significantly outperforming their multi-threaded counterparts in both paradigms. We conclude that even when performance differences appear subtle in simple algorithms, the consistent advantages of DOD in critical metrics highlight its foundational architectural superiority, suggesting it is a more effective approach for maximizing hardware efficiency in complex, large-scale AI and parallel computing tasks.


Robustifying Learning-Augmented Caching Efficiently without Compromising 1-Consistency

Chen, Peng, Zhao, Hailiang, Zhang, Jiaji, Tang, Xueyan, Wang, Yixuan, Deng, Shuiguang

arXiv.org Artificial Intelligence

The online caching problem aims to minimize cache misses when serving a sequence of requests under a limited cache size. While naive learning-augmented caching algorithms achieve ideal $1$-consistency, they lack robustness guarantees. Existing robustification methods either sacrifice $1$-consistency or introduce excessive computational overhead. In this paper, we introduce Guard, a lightweight robustification framework that enhances the robustness of a broad class of learning-augmented caching algorithms to $2H_{k-1} + 2$, while preserving their $1$-consistency. Guard achieves the current best-known trade-off between consistency and robustness, with only O(1) additional per-request overhead, thereby maintaining the original time complexity of the base algorithm. Extensive experiments across multiple real-world datasets and prediction models validate the effectiveness of Guard in practice.


Appendix of A Deep Learning Dataloader with Shared Data Preparation

Neural Information Processing Systems

In this part, we show the I/O speed in the synchronous and asynchronous cases. Figure 3a show the I/O speed for four jobs that start at different moments. Then we further compare the RefCnt with the generic cache policy in the above cases. D = sample ([0, 13333], 10000) means sample a subset D with 10000 of size from [0, 13333] uniformly at random 36th Conference on Neural Information Processing Systems (NeurIPS 2022). DSA can always get the minimum misses.


Cache Management for Mixture-of-Experts LLMs -- extended version

Angelopoulos, Spyros, Marchal, Loris, Obrecht, Adrien, Simon, Bertrand

arXiv.org Artificial Intelligence

Large language models (LLMs) have demonstrated remarkable capabilities across a variety of tasks. One of the main challenges towards the successful deployment of LLMs is memory management, since they typically involve billions of parameters. To this end, architectures based on Mixture-of-Experts have been proposed, which aim to reduce the size of the parameters that are activated when producing a token. This raises the equally critical issue of efficiently managing the limited cache of the system, in that frequently used experts should be stored in the fast cache rather than in the slower secondary memory. In this work, we introduce and study a new paging problem that models expert management optimization. Our formulation captures both the layered architecture of LLMs and the requirement that experts are cached efficiently. We first present lower bounds on the competitive ratio of both deterministic and randomized algorithms, which show that under mild assumptions, LRU-like policies have good theoretical competitive performance. We then propose a layer-based extension of LRU that is tailored to the problem at hand. Extensive simulations on both synthetic datasets and actual traces of MoE usage show that our algorithm outperforms policies for the classic paging problem, such as the standard LRU.



NVR: Vector Runahead on NPUs for Sparse Memory Access

Wang, Hui, Zhao, Zhengpeng, Wang, Jing, Du, Yushu, Cheng, Yuan, Guo, Bing, Xiao, He, Ma, Chenhao, Han, Xiaomeng, You, Dean, Guan, Jiapeng, Wei, Ran, Yang, Dawei, Jiang, Zhe

arXiv.org Artificial Intelligence

--Deep Neural Networks are increasingly leveraging sparsity to reduce the scaling up of model parameter size. However, reducing wall-clock time through sparsity and pruning remains challenging due to irregular memory access patterns, leading to frequent cache misses. In this paper, we present NPU V ector Runahead (NVR), a prefetching mechanism tailored for NPUs to address cache miss problems in sparse DNN workloads. NVR provides a general micro-architectural solution for sparse DNN workloads without requiring compiler or algorithmic support, operating as a decoupled, speculative, lightweight hardware sub-thread alongside the NPU, with minimal hardware overhead (under 5%). NVR achieves an average 90% reduction in cache misses compared to SOT A prefetching in general-purpose processors, delivering 4x average speedup on sparse workloads versus NPUs without prefetching. Moreover, we investigate the advantages of incorporating a small cache (16KB) into the NPU combined with NVR. Our evaluation shows that expanding this modest cache delivers 5x higher performance benefits than increasing the L2 cache size by the same amount. Fortunately, these workloads are typically over-parameterised [3], where up to 90% of parameters in prevalent models can be pruned while maintaining comparable performance [4]. This redundancy presents an opportunity to leverage sparsity to reduce such intensive resource demands. Theoretically, more fine-grained sparsity patterns yield higher acceleration by skipping more zero-valued elements.


Auditing Prompt Caching in Language Model APIs

Gu, Chenchen, Li, Xiang Lisa, Kuditipudi, Rohith, Liang, Percy, Hashimoto, Tatsunori

arXiv.org Artificial Intelligence

Prompt caching in large language models (LLMs) results in data-dependent timing variations: cached prompts are processed faster than non-cached prompts. These timing differences introduce the risk of side-channel timing attacks. For example, if the cache is shared across users, an attacker could identify cached prompts from fast API response times to learn information about other users' prompts. Because prompt caching may cause privacy leakage, transparency around the caching policies of API providers is important. To this end, we develop and conduct statistical audits to detect prompt caching in real-world LLM API providers. We detect global cache sharing across users in seven API providers, including OpenAI, resulting in potential privacy leakage about users' prompts. Timing variations due to prompt caching can also result in leakage of information about model architecture. Namely, we find evidence that OpenAI's embedding model is a decoder-only Transformer, which was previously not publicly known.


Accelerating the k-means++ Algorithm by Using Geometric Information

Corominas, Guillem Rodríguez, Blesa, Maria J., Blum, Christian

arXiv.org Artificial Intelligence

The k-means clustering is a widely used method in data clustering and unsupervised machine learning, aiming to divide a given dataset into k distinct, non-overlapping clusters. This division seeks to minimize the within-cluster variance. The k-means clustering problem becomes NP-hard when extended beyond a single dimension [3]. Despite this complexity, there are algorithms designed to find sufficiently good solutions within a reasonable amount of time. Among these, Lloyd's algorithm, also referred to as the standard algorithm or batch k-means, is the most renowned [42]. The k-means algorithm is one of the most popular algorithms in data mining [58, 32], mainly due to its simplicity, scalability, and guaranteed termination. However, its performance is highly sensible to the initial placement of the centers [5]. In fact, there is no general approximation expectation for Lloyd's algorithm that applies to all scenarios, i.e., an arbitrary initialization may lead to an arbitrarily bad clustering. Therefore, it is crucial to employ effective initialization methods [24].


Effectiveness and predictability of in-network storage cache for scientific workflows

Sim, Caitlin, Wu, Kesheng, Sim, Alex, Monga, Inder, Guok, Chin, Wurthwein, Frank, Davila, Diego, Newman, Harvey, Balcas, Justas

arXiv.org Artificial Intelligence

Large scientific collaborations often have multiple scientists accessing the same set of files while doing different analyses, which create repeated accesses to the large amounts of shared data located far away. These data accesses have long latency due to distance and occupy the limited bandwidth available over the wide-area network. To reduce the wide-area network traffic and the data access latency, regional data storage caches have been installed as a new networking service. To study the effectiveness of such a cache system in scientific applications, we examine the Southern California Petabyte Scale Cache for a high-energy physics experiment. By examining about 3TB of operational logs, we show that this cache removed 67.6% of file requests from the wide-area network and reduced the traffic volume on wide-area network by 12.3TB (or 35.4%) an average day. The reduction in the traffic volume (35.4%) is less than the reduction in file counts (67.6%) because the larger files are less likely to be reused. Due to this difference in data access patterns, the cache system has implemented a policy to avoid evicting smaller files when processing larger files. We also build a machine learning model to study the predictability of the cache behavior. Tests show that this model is able to accurately predict the cache accesses, cache misses, and network throughput, making the model useful for future studies on resource provisioning and planning.


MUSTACHE: Multi-Step-Ahead Predictions for Cache Eviction

Tolomei, Gabriele, Takanen, Lorenzo, Pinelli, Fabio

arXiv.org Artificial Intelligence

In this work, we propose MUSTACHE, a new page cache replacement algorithm whose logic is learned from observed memory access requests rather than fixed like existing policies. We formulate the page request prediction problem as a categorical time series forecasting task. Then, our method queries the learned page request forecaster to obtain the next $k$ predicted page memory references to better approximate the optimal B\'el\'ady's replacement algorithm. We implement several forecasting techniques using advanced deep learning architectures and integrate the best-performing one into an existing open-source cache simulator. Experiments run on benchmark datasets show that MUSTACHE outperforms the best page replacement heuristic (i.e., exact LRU), improving the cache hit ratio by 1.9% and reducing the number of reads/writes required to handle cache misses by 18.4% and 10.3%.